-
-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: arrow_path_to_parquet
#1239
Conversation
WalkthroughThe changes introduce a new implementation of the Changes
Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/parseable/streams.rs (1)
94-106
: Improve error handling inarrow_path_to_parquet
function.The function correctly extracts the front part from the arrow file name using the regex pattern, but it uses
unwrap()
twice on line 96, which could panic if the path has no filename or the filename is not valid UTF-8.Consider improving the error handling:
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> { - let filename = path.file_name().unwrap().to_str().unwrap(); + let filename = path.file_name()?.to_str()?; let filename = ARROWS_NAME_STRUCTURE .captures(filename) .and_then(|c| c.get(1))? .as_str(); let filename_with_random_number = format!("{filename}.data.{random_string}.parquet"); let mut parquet_path = path.to_owned(); parquet_path.set_file_name(filename_with_random_number); Some(parquet_path) }This makes the function more robust by using the
?
operator to propagate theNone
value if either operation fails.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/parseable/streams.rs
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: coverage
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
🔇 Additional comments (4)
src/parseable/streams.rs (4)
36-45
: Good addition of necessary imports.The imports for
once_cell::sync::Lazy
andregex::Regex
are appropriate for implementing the regex-based file name validation solution.
224-231
: Improved arrow file filtering using regex.The updated filtering logic using the regex pattern provides more robust validation of arrow filenames compared to the previous simple extension check.
274-281
: Good error handling with warning for unexpected arrow files.Adding a warning log when an arrow file doesn't match the expected pattern is a good practice. It will help identify problematic files during operation.
73-92
:❓ Verification inconclusive
Well-documented regex pattern for arrow file name validation.
The static regex pattern with detailed documentation explains the expected format, limitations, and gives examples, which is excellent for maintainability.
One consideration: the pattern uses
\S+
for the front part which allows any non-whitespace character, but the documentation mentions limitations on partition keys/values. Consider if the regex should be more restrictive to enforce the documented limitations.
🏁 Script executed:
#!/bin/bash # Check if there are any existing arrow files that might not match the new pattern # This will help identify if the regex might be too restrictive for existing files # Find current arrow files echo "Searching for existing arrow files..." find . -name "*.arrows" | grep -v "node_modules" | head -n 10 # Test a few example filenames against the regex echo -e "\nTesting regex against sample filenames:" test_regex() { if [[ "$1" =~ ^[a-zA-Z0-9&=]+\.(.*?)\.data\.arrows$ ]]; then echo "MATCH: $1" else echo "NO MATCH: $1" fi } # Valid examples test_regex "abc123.date=2023-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows" test_regex "20200201T1830f8a5fc1edc567d56&key1=value1&key2=value2.date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows" # Invalid examples test_regex "abc-123.date=2023-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows" # Invalid character in schema_key test_regex "abc123.date=2023-01-21.hour=10.minute=30.key1=value-1.key2=value2.ee529ffc8e76.data.arrows" # Has hyphen in valueLength of output: 374
Attention: Refine Regex Restrictions & Address Shell Test Syntax
The documentation for the arrow file regex is clear and detailed, which is excellent for maintainability. However, two points need your attention:
Regex Specificity: The regex currently uses
\S+
to capture the "front" part, which permits any non-whitespace characters. This does not enforce the documented limitations that partition keys and values must be strictly alphanumeric. Please consider updating the regex to use a more precise character class that only allows alphanumeric characters (and specific allowed punctuation) as described.Shell Test Script Issue: The test script encountered a syntax error due to the unescaped ampersand (
&
) within the regex. When running regex tests in bash, special characters like&
should be properly escaped or quoted to prevent shell misinterpretation. This adjustment will ensure reliable test execution.Please address these items to align the implementation with the documentation and improve test reliability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/parseable/streams.rs
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: coverage
🔇 Additional comments (5)
src/parseable/streams.rs (5)
36-36
: Appropriate imports added.The imports for
once_cell::sync::Lazy
andregex::Regex
are correctly added to support the new regex-based file name extraction functionality.Also applies to: 45-45
73-92
: Good documentation and regex pattern implementation.The documentation thoroughly explains the format, limitations, and provides examples for the arrow file name structure. Using
Lazy::new()
for initializing the static regex is a good practice that ensures the regex is compiled only once when needed.The regex pattern looks correct for validating the described format:
- Schema key (alphanumeric with
&
and=
characters)- Front part (for parquet file naming)
- Suffix
.data.arrows
227-231
: Improved file filtering with regex validation.Replacing the simple extension check with regex validation improves the robustness of the
arrow_files
method by ensuring that files follow the expected naming pattern.
274-281
: Added warning for unexpected arrow files.Good improvement to log warnings for arrow files that don't match the expected pattern. This will help with debugging file naming issues.
1234-1296
: Comprehensive test coverage for the new functionality.The tests cover a good range of scenarios including:
- Valid arrow path conversion
- Invalid arrow paths (missing suffix)
- Invalid schema keys with special characters
- Complex nested paths
- Edge cases with empty front part
These tests will help ensure the robustness of the implementation.
arrow_path_to_parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/parseable/streams.rs
(5 hunks)
🔇 Additional comments (3)
src/parseable/streams.rs (3)
198-200
: Improved file filtering approachGood use of the
ARROW_FILE_EXTENSION
constant and the more conciseis_some_and
method for filtering. This makes the code more maintainable and less prone to errors from hardcoded strings.
243-249
: Better error handling for unexpected arrow filesThe new implementation properly handles cases where
arrow_path_to_parquet
returnsNone
, adding helpful warning logs. This improves robustness and debuggability when encountering unexpected file formats.
1196-1284
:✅ Verification successful
Comprehensive test coverage for the new functionality
The tests thoroughly cover various scenarios including valid paths, invalid paths, complex directory structures, and edge cases. This helps ensure the reliability of the new implementation.
Verify filename format consistency with these tests:
🏁 Script executed:
#!/bin/bash # Check if there are any existing arrow filename patterns in the codebase # Look for regex patterns that might already be defined for arrow filenames echo "Checking for existing regex patterns for arrow filenames:" rg -A 2 "Regex::new|lazy_static.*Regex|regex!|static.*Regex" --type rust # Look for other string operations on arrow filenames echo "Checking for other string operations on arrow filenames:" rg -A 2 "\.split|\.parse|\.contains.*arrow" --type rustLength of output: 15295
Approval: Arrow Filename Conversion Tests Provide Comprehensive Coverage and Maintain Consistent Format
The test suite in
src/parseable/streams.rs
thoroughly validates the arrow file conversion logic by:
- Confirming that valid arrow filenames are correctly transformed into parquet filenames using the supplied random string.
- Ensuring that edge cases—such as invalid paths, schema keys with disallowed characters, empty front parts, and files within nested directories—are properly handled.
- Verifying that the filename format in the conversion function (i.e., splitting on the first dot and appending the new suffix) is consistent with the behavior expected by the tests.
Additionally, our search of the codebase reveals no conflicting regex patterns or alternative string manipulation methods for arrow filenames. This confirms that the current implementation is consistent and in line with the repository’s overall filename handling patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/parseable/streams.rs (1)
70-80
:⚠️ Potential issueMismatch between PR objective and implementation
The PR mentions using regex for processing complex filenames, but the function
arrow_path_to_parquet
still uses string operations (split_once
) instead of regular expressions. Additionally, the assertion on line 74 assumes the front part will contain a delimiter, which could cause a panic if not present.Consider implementing the regex approach as mentioned in the PR objectives:
/// Returns the filename for parquet if provided arrows file path is valid as per our expectation fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> { let filename = path.file_stem()?.to_str()?; - let (_, front) = filename.split_once('.')?; - assert!(front.contains('.'), "contains the delim `.`"); + // Use regex to extract parts after the schema key + let re = Regex::new(r"^[^.]+\.(.+)$").ok()?; + let front = re.captures(filename)?.get(1)?.as_str(); let filename_with_random_number = format!("{front}.data.{random_string}.parquet"); let mut parquet_path = path.to_owned(); parquet_path.set_file_name(filename_with_random_number); Some(parquet_path) }
🧹 Nitpick comments (1)
src/parseable/streams.rs (1)
1196-1284
: Comprehensive test coverage for the new functionalityThe tests thoroughly cover various scenarios including valid paths, invalid paths, complex paths, and edge cases. This will help maintain the functionality over time.
One improvement to consider - add a test that specifically validates the regex pattern (when implemented) against more diverse and complex filenames that might appear in production.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/parseable/streams.rs
(5 hunks)
🔇 Additional comments (2)
src/parseable/streams.rs (2)
197-200
: LGTM! Improved code maintainability with constant usageUsing the constant
ARROW_FILE_EXTENSION
instead of a hardcoded string improves maintainability and consistency.
243-250
: Improved error handling for unexpected arrow filesThe added else branch properly logs a warning when an unexpected arrow file is encountered, enhancing error visibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/parseable/streams.rs (1)
248-249
: Warn-only approach for unexpected Arrow files.
The code logs a warning for files that don’t matcharrow_path_to_parquet
. If unrecoverable conditions are possible, consider surfacing them with a stronger error or by removing the file.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/parseable/streams.rs
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: coverage
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (7)
src/parseable/streams.rs (7)
197-200
: Conditional check looks good.
Usingis_some_and
here is concise. Just confirm your Rust version is ≥1.63 if you rely on this method.
243-244
: Good handling when the path conversion succeeds.
This code cleanly associates valid Parquet keys with their Arrow files.
1215-1222
: Helper function for test file creation is clear.
The dummy content writing approach is straightforward and fits well with test needs.
1224-1239
: Valid Arrow path test.
This test verifies that the function returns a correct.parquet
filename. Looks good.
1241-1252
: Potential mismatch with function logic.
This test expectsNone
for missing.data.arrows
suffix, but the function only checks for the.arrows
extension and a split with.
. Verify that this test can pass with the current implementation.
1254-1265
: Invalid schema key scenario may not returnNone
.
The function does not reject special characters in front parts. Please verify if the test passes under real usage.
1267-1290
: Complex path handling test is comprehensive.
It validates deep directory structures and multiple delimiters. This aligns well with your function’s current logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/parseable/streams.rs (1)
70-80
: 🛠️ Refactor suggestionPossible panic with assert! contradicts Option return type pattern
The function correctly uses
?
operators to propagate errors viaOption
, but then usesassert!
on line 74 which would cause a panic iffront
doesn't contain a dot. This contradicts the fallible nature of the function and the pattern established with the return type.The panic condition would interrupt normal program flow rather than allowing the caller to handle the error case gracefully through the
Option
return type.Replace the assertion with a fallible check:
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> { let filename = path.file_stem()?.to_str()?; let (_, front) = filename.split_once('.')?; - assert!(front.contains('.'), "contains the delim `.`"); + if !front.contains('.') { + return None; + } let filename_with_random_number = format!("{front}.{random_string}.parquet"); let mut parquet_path = path.to_owned(); parquet_path.set_file_name(filename_with_random_number); Some(parquet_path) }
🧹 Nitpick comments (1)
src/parseable/streams.rs (1)
1241-1264
: Test uses complex path but doesn't verify full path correctnessThe test validates only the filename portion of the result, but doesn't verify that the directory structure is preserved correctly.
Add an assertion to verify the directory structure is preserved:
assert!(result.is_some()); let parquet_path = result.unwrap(); assert_eq!( parquet_path.file_name().unwrap().to_str().unwrap(), "date=2020-01-21.hour=10.minute=30.region=us-west.ee529ffc8e76.data.random456.parquet" ); +// Verify that the directory structure is preserved +assert_eq!( + parquet_path.parent().unwrap(), + nested_dir +);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/parseable/streams.rs
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: coverage
🔇 Additional comments (3)
src/parseable/streams.rs (3)
197-200
: Good use of constant instead of hardcoded extensionThe change to use the
ARROW_FILE_EXTENSION
constant instead of a hardcoded string value improves code maintainability.
243-250
: Improved error handling with warning for unexpected arrow filesAdding a warning log when encountering unexpected arrow files increases robustness and visibility of potential issues.
1215-1222
: Well-structured test helper functionThe
create_test_file
function is a good abstraction for test setup, making the test cases cleaner and more maintainable.
#[test] | ||
fn test_valid_arrow_path_conversion() { | ||
let temp_dir = TempDir::new().expect("Failed to create temp dir"); | ||
let filename = "12345abcde&key1=value1.date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows"; | ||
let file_path = create_test_file(&temp_dir, filename); | ||
let random_string = "random123"; | ||
|
||
let result = arrow_path_to_parquet(&file_path, random_string); | ||
|
||
assert!(result.is_some()); | ||
let parquet_path = result.unwrap(); | ||
assert_eq!( | ||
parquet_path.file_name().unwrap().to_str().unwrap(), | ||
"date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.random123.parquet" | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test doesn't check filename structure expectation properly
The expected output filename in test_valid_arrow_path_conversion
is inconsistent with the current implementation of arrow_path_to_parquet
. The test assumes a format that includes the "data" component which would be created by format!("{front}.data.{random_string}.parquet")
, but the implementation uses format!("{front}.{random_string}.parquet")
.
Either update the test expectation or change the implementation to match. If you change the implementation, be sure to update line 75 as suggested above.
#[test] | ||
fn test_valid_arrow_path_conversion() { | ||
let temp_dir = TempDir::new().expect("Failed to create temp dir"); | ||
let filename = "12345abcde&key1=value1.date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.arrows"; | ||
let file_path = create_test_file(&temp_dir, filename); | ||
let random_string = "random123"; | ||
|
||
let result = arrow_path_to_parquet(&file_path, random_string); | ||
|
||
assert!(result.is_some()); | ||
let parquet_path = result.unwrap(); | ||
assert_eq!( | ||
parquet_path.file_name().unwrap().to_str().unwrap(), | ||
"date=2020-01-21.hour=10.minute=30.key1=value1.key2=value2.ee529ffc8e76.data.random123.parquet" | ||
); | ||
} | ||
|
||
#[test] | ||
fn test_complex_path() { | ||
let temp_dir = TempDir::new().expect("Failed to create temp dir"); | ||
let nested_dir = temp_dir.path().join("nested/directory/structure"); | ||
std::fs::create_dir_all(&nested_dir).expect("Failed to create nested directories"); | ||
|
||
let filename = "20200201T1830f8a5fc1edc567d56&key1=value1&key2=value2.date=2020-01-21.hour=10.minute=30.region=us-west.ee529ffc8e76.data.arrows"; | ||
let file_path = nested_dir.join(filename); | ||
|
||
let mut file = File::create(&file_path).expect("Failed to create test file"); | ||
file.write_all(b"test content") | ||
.expect("Failed to write to test file"); | ||
|
||
let random_string = "random456"; | ||
|
||
let result = arrow_path_to_parquet(&file_path, random_string); | ||
|
||
assert!(result.is_some()); | ||
let parquet_path = result.unwrap(); | ||
assert_eq!( | ||
parquet_path.file_name().unwrap().to_str().unwrap(), | ||
"date=2020-01-21.hour=10.minute=30.region=us-west.ee529ffc8e76.data.random456.parquet" | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add negative test cases for arrow_path_to_parquet function
While you've added tests for valid paths, there are no tests for invalid paths where the function should return None
.
Add tests for cases where the function should return None
:
#[test]
fn test_invalid_arrow_path_without_extension() {
let temp_dir = TempDir::new().expect("Failed to create temp dir");
let filename = "invalid_filename_without_extension";
let file_path = create_test_file(&temp_dir, filename);
let random_string = "random123";
let result = arrow_path_to_parquet(&file_path, random_string);
assert!(result.is_none());
}
#[test]
fn test_invalid_arrow_path_without_delimiters() {
let temp_dir = TempDir::new().expect("Failed to create temp dir");
let filename = "invalid_filename.arrows";
let file_path = create_test_file(&temp_dir, filename);
let random_string = "random123";
let result = arrow_path_to_parquet(&file_path, random_string);
assert!(result.is_none());
}
#[test]
fn test_invalid_arrow_path_without_second_delimiter() {
let temp_dir = TempDir::new().expect("Failed to create temp dir");
let filename = "schema.nodelimiter.arrows";
let file_path = create_test_file(&temp_dir, filename);
let random_string = "random123";
// This would fail with the current implementation due to the assert!
// but should return None if the assert is replaced with a check
let result = arrow_path_to_parquet(&file_path, random_string);
assert!(result.is_none());
}
/// Returns the filename for parquet if provided arrows file path is valid as per our expectation | ||
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> { | ||
let filename = path.file_stem()?.to_str()?; | ||
let (_, front) = filename.split_once('.')?; | ||
assert!(front.contains('.'), "contains the delim `.`"); | ||
let filename_with_random_number = format!("{front}.{random_string}.parquet"); | ||
let mut parquet_path = path.to_owned(); | ||
parquet_path.set_file_name(filename_with_random_number); | ||
|
||
Some(parquet_path) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format string doesn't match expected parquet file format
The format string used for constructing the parquet filename on line 75 doesn't include the 'data' portion that's present in the arrow filename, but the tests expect it to be included.
Update the format string to maintain the same structure:
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> {
let filename = path.file_stem()?.to_str()?;
let (_, front) = filename.split_once('.')?;
assert!(front.contains('.'), "contains the delim `.`");
- let filename_with_random_number = format!("{front}.{random_string}.parquet");
+ let filename_with_random_number = format!("{front}.data.{random_string}.parquet");
let mut parquet_path = path.to_owned();
parquet_path.set_file_name(filename_with_random_number);
Some(parquet_path)
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/// Returns the filename for parquet if provided arrows file path is valid as per our expectation | |
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> { | |
let filename = path.file_stem()?.to_str()?; | |
let (_, front) = filename.split_once('.')?; | |
assert!(front.contains('.'), "contains the delim `.`"); | |
let filename_with_random_number = format!("{front}.{random_string}.parquet"); | |
let mut parquet_path = path.to_owned(); | |
parquet_path.set_file_name(filename_with_random_number); | |
Some(parquet_path) | |
} | |
/// Returns the filename for parquet if provided arrows file path is valid as per our expectation | |
fn arrow_path_to_parquet(path: &Path, random_string: &str) -> Option<PathBuf> { | |
let filename = path.file_stem()?.to_str()?; | |
let (_, front) = filename.split_once('.')?; | |
assert!(front.contains('.'), "contains the delim `.`"); | |
let filename_with_random_number = format!("{front}.data.{random_string}.parquet"); | |
let mut parquet_path = path.to_owned(); | |
parquet_path.set_file_name(filename_with_random_number); | |
Some(parquet_path) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
front already carries ".data" at the end as asserted above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@de-sh You're absolutely right. After tracing through the function with examples from the tests, I can see that "front" already contains ".data" at the end.
For example, when processing a file named "abc.date=2023-10-01.hour=12.minute=30.hostname.data.arrows":
- file_stem → "abc.date=2023-10-01.hour=12.minute=30.hostname.data"
- After split_once('.') → front = "date=2023-10-01.hour=12.minute=30.hostname.data"
- Result with format → "date=2023-10-01.hour=12.minute=30.hostname.data.{random_string}.parquet"
The original implementation is correct. Thanks for the clarification!
✏️ Learnings added
Learnt from: de-sh
PR: parseablehq/parseable#1239
File: src/parseable/streams.rs:70-80
Timestamp: 2025-03-24T06:08:47.139Z
Learning: In the Parseable codebase, arrow filenames follow a structure ending with ".data.arrows". When converting to parquet, the "front" part (after the first dot) already includes ".data" at the end, so there's no need to add it explicitly in the arrow_path_to_parquet function.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to merge
Fixes #XXXX.
Description
Adds tests to validate working of the
arrow_path_to_parquet
functionThis PR has:
Summary by CodeRabbit
Summary by CodeRabbit